How much homophony should we expect for a given wordform, as a function of its phonotactics and word length? Is this expected homophony more or less than the observed homophony for that wordform? And critically: does this Homophony Delta vary across wordforms as a function of Wordform Frequency?
That is, if there is a pressure against homophony in real lexica (Trott & Bergen, 2020), that pressure might manifest unevenly across the lexicon. Based on previous work, it is possible that the pressure against homophony is actually weakest for long wordforms, as well as phonotactically implausible wordforms. Why would this be the case? It seems counterintuitive to suppose that a lexicon designed for efficiency would not optimize the use of its wordforms by concentrating homophones among the shortest, most phonotactically plausible wordforms.
One explanation is that word length is correlated with frequency. A negative selection pressure might be strongest among wordforms for which, if ambiguous, would require very frequent disambiguation. If this interpretation is correct, Homophony Delta should correlate negatively with Frequency.
Again, this prediction appears counterintuitive, given the extensive evidence that wordform frequency is positively correlated with ambiguity, whether measured in terms of Number of homophones or Number of senses (Zipf, 1945; Piantadosi et al, 2012). But this observation would ultimately still be consistent with a mechanism selecting against homophones in the most frequent wordforms. The point is not that frequent wordforms should have fewer homophones than long wordforms; rather, frequent wordforms should have fewer homophones than one would expect on account of their phonotactics than length, while this gap, or Homophony Delta, should be smaller (or even inverted) for infrequent wordforms.
Given \(M\) meanings to express, and \(W\) wordforms with which to express them, each associated with some probability \(p_i\), how many meanings should each wordform \(w_i\) carry as a function of its normalized probability \(p_i\)?
We also define several helper functions to simplify the data processing pipeline.
First, we need a function that reads in an aggregated lexicon (i.e., across homophonous wordforms), as well as the original lexicon with original entries intact; this function will then identify the corrected number of meanings M per word length, and merge that information with the aggregated lexicon.
read_data = function(agg_path) {
# Read in aggregated data
df_agg = read_csv(agg_path)
# Below, we identify the number of *words* of each word length.
df_total_counts = df_agg %>%
mutate(k_actual = num_homophones + 1) %>%
group_by(num_sylls_est) %>%
summarise(M = sum(k_actual))
# Merge with counts
df_agg = merge(df_agg, df_total_counts, by = "num_sylls_est")
nrow(df_agg)
df_agg
}
We then define a function to normalize the probability of a given wordform relative to the sum of probabilities of wordforms for that length.
normalize_probabilities = function(df) {
### For a given dataframe, normalize probabilities "p" relative to #words of a given elgnth.
## First get probabilities
df = df %>%
mutate(normalized_surprisal = heldout_surprisal/num_phones,
p = 10 ** heldout_log_prob)
# Calculate sum(p) for each syllable length bucket.
df_syl_sums = df %>%
group_by(num_sylls_est) %>%
summarise(total_prob = sum(p))
# Merge with aggregated data, and normalize
df = df %>%
left_join(df_syl_sums, on = "num_sylls_est") %>%
mutate(p_normalized = p / total_prob)
df
}
We now define a function to compute the expected number of homophones \(k_i\) for a given wordform \(w_i\). This will also calculate the delta between the real and expected amount.
compute_expected_homophones = function(df) {
## Compute expected homophones "k" of a wordform by multiplying normalized probability
## "p" by the number of meanings "M".
df = df %>%
# Expected number of "successes" (entries), minus 1
mutate(k = p_normalized * M - 1) %>%
mutate(delta = num_homophones - k)
df
}
We also define a function to run the main regression:
run_regression = function(df) {
# Initialize output
out = list()
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal)
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal)
# Run model comparison
comparison = anova(model_reduced, model_full)
df_comp = broom::tidy(comparison) %>%
na.omit()
# Tidy model output
df_model = broom::tidy(model_full)
# Add to list parameters
out$model = df_model
out$comparison = df_comp
out
}
And several plotting functions:
plot_outcome = function(df_output) {
df_output$model$predictor = fct_recode(
df_output$model$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency"
# "Neighborhood Size" = "neighborhood_size"
)
### Plots outcome of regression
g = df_output$model %>%
ggplot(aes(x = predictor,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_minimal()
g
}
plot_comparison = function(df) {
# Plots expected vs. actual per each wordform.
g = df %>%
ggplot(aes(x = k,
y = num_homophones)) +
geom_point(alpha = .5) +
scale_x_continuous(limits = c(-1, max(df$k))) +
scale_y_continuous(limits = c(0, max(df$k))) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Expected number of homophones",
y = "Actual number of homophones") +
theme_minimal()
g
}
plot_bins = function(df, n, column, label) {
# Plots delta ~ frequency bins.
df$binned = as.numeric(ntile(pull(df, column), n = n))
g = df %>%
group_by(binned) %>%
summarise(mean_delta = mean(delta),
se_delta = sd(delta) / sqrt(n())) %>%
ggplot(aes(x = binned,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - se_delta,
ymax = mean_delta + se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = label,
y = "Delta (Real - Expected)") +
geom_smooth() +
theme_minimal()
g
}
plot_contrast = function(df, bins = 20) {
df$frequency_binned = as.numeric(ntile(df$frequency, n = bins))
g = df %>%
mutate(expected = k,
actual = num_homophones) %>%
pivot_longer(c(expected, actual), names_to = "model", values_to = "homophones") %>%
ggplot(aes(x = frequency_binned,
y = homophones,
color = model)) +
stat_summary (fun = function(x){mean(x)},
fun.min = function(x){mean(x) - sd(x)/sqrt(length(x))},
fun.max = function(x){mean(x) + sd(x)/sqrt(length(x))},
geom= 'pointrange') +
# position=position_dodge(width=0.95)) +
labs(x = "Binned Frequency",
y = "Number of Homophones") +
theme_bw()
g
}
First, we load and process the data.
## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CobLog = col_double(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## Class = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 9 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 7706 7706
## 2 2 15247 15247
## 3 3 11379 11379
## 4 4 5316 5316
## 5 5 1694 1694
## 6 6 439 439
## 7 7 95 95
## 8 8 10 10
## 9 10 1 1
We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)
df_english_lemmas = read_delim("../data/frequency/english/lemmas.csv", delim = "\\",
quote = "")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Cob = col_double()
## )
nrow(df_english_lemmas)
## [1] 52447
df_english_lemmas = df_english_lemmas %>%
mutate(freq_adjusted = Cob + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_english_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
total_frequency = mean(total_frequency))
df_merged_english = df_english %>%
# inner_join(df_entropy, by = "PhonDISC")
inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_english)
## [1] 28914
df_merged_english$frequency = log10(df_merged_english$total_frequency)
Now we can visualize the outcome in a variety of ways:
df_merged_english %>%
plot_comparison()
df_merged_english %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_english %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.54 0.362 -18.1 1.43e- 72
## 2 frequency -0.494 0.0650 -7.59 3.22e- 14
## 3 num_sylls_est 0.727 0.0730 9.96 2.50e- 23
## 4 normalized_surprisal 3.58 0.131 27.3 7.43e-162
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28910 2742636. 1 5469. 57.7 3.22e-14
Directly contrast the relationships:
df_merged_english %>%
plot_contrast(bins = 20)
nrow(filter(df_merged_english, num_homophones == 1))
## [1] 4149
model_full = lm(data = filter(df_merged_english, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_english, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -246.606 -0.603 1.029 2.554 6.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.4331 0.9999 -7.434 1.27e-13 ***
## entropy 1.1894 0.6551 1.816 0.069501 .
## frequency -0.9682 0.1731 -5.594 2.36e-08 ***
## num_sylls_est 0.7544 0.1955 3.860 0.000115 ***
## normalized_surprisal 4.6364 0.3064 15.133 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.805 on 4144 degrees of freedom
## Multiple R-squared: 0.07287, Adjusted R-squared: 0.07197
## F-statistic: 81.42 on 4 and 4144 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_english, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4145 321539
## 2 4144 321283 1 255.57 3.2965 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, we load and process the data.
df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## PhonStrsDISC = col_character(),
## PhonCVBr = col_character(),
## PhonSylBCLX = col_character(),
## PhonStrsStDISC = col_character(),
## PhonStCVBr = col_character(),
## PhonSylStBCLX = col_character(),
## PhonolCLX = col_character(),
## PhonolCPA = col_character(),
## PhonDISC = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 9 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 4672 4672
## 2 2 20588 20588
## 3 3 26420 26420
## 4 4 11338 11338
## 5 5 3394 3394
## 6 6 846 846
## 7 7 189 189
## 8 8 28 28
## 9 9 2 2
We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)
df_dutch_lemmas = read_delim("../data/frequency/dutch/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Inl = col_double()
## )
nrow(df_dutch_lemmas)
## [1] 123926
df_dutch_lemmas = df_dutch_lemmas %>%
mutate(freq_adjusted = Inl + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_dutch_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
total_frequency = mean(total_frequency))
df_merged_dutch = df_dutch %>%
# inner_join(df_entropy, by = "PhonDISC")
inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_dutch)
## [1] 63364
df_merged_dutch$frequency = log10(df_merged_dutch$total_frequency)
Now we can visualize the outcome in a variety of ways:
df_merged_dutch %>%
plot_comparison()
df_merged_dutch %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_dutch %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.02 0.385 -13.0 9.43e- 39
## 2 frequency -1.85 0.0688 -26.9 3.35e-158
## 3 num_sylls_est 0.402 0.0712 5.65 1.66e- 8
## 4 normalized_surprisal 4.04 0.165 24.4 3.48e-131
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63360 16265983. 1 185420. 722. 3.35e-158
Directly contrast the relationships:
df_merged_dutch %>%
plot_contrast(bins = 20)
nrow(filter(df_merged_dutch, num_homophones == 1))
## [1] 1630
model_full = lm(data = filter(df_merged_dutch, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_dutch, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -562.49 -2.13 2.17 5.85 20.02
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1691 3.2415 -1.286 0.199
## entropy -2.8531 2.4407 -1.169 0.243
## frequency -4.4829 0.5858 -7.653 3.36e-14 ***
## num_sylls_est -0.3627 0.6626 -0.547 0.584
## normalized_surprisal 8.4880 1.0560 8.038 1.74e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.99 on 1625 degrees of freedom
## Multiple R-squared: 0.08019, Adjusted R-squared: 0.07793
## F-statistic: 35.42 on 4 and 1625 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_dutch, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1626 786128
## 2 1625 785468 1 660.52 1.3665 0.2426
df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 10 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 2454 2454
## 2 2 13181 13181
## 3 3 19429 19429
## 4 4 10983 10983
## 5 5 3971 3971
## 6 6 1240 1240
## 7 7 328 328
## 8 8 72 72
## 9 9 16 16
## 10 10 2 2
We then merge with data for frequency of each individual lemma. (We also calculate entropy over senses in the process.)
df_german_lemmas = read_delim("../data/frequency/german/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
## PhonDISC = col_character(),
## Mann = col_double()
## )
nrow(df_german_lemmas)
## [1] 51728
df_german_lemmas = df_german_lemmas %>%
mutate(freq_adjusted = Mann + 1) %>%
group_by(PhonDISC) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_german_lemmas %>%
group_by(PhonDISC) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
total_frequency = mean(total_frequency))
df_merged_german = df_german %>%
# inner_join(df_entropy, by = "PhonDISC")
inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_german)
## [1] 46720
df_merged_german$frequency = log10(df_merged_german$total_frequency)
Now we can visualize the outcome in a variety of ways:
df_merged_german %>%
plot_comparison()
df_merged_german %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_german %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.59 0.428 -10.7 8.05e-27
## 2 frequency -1.28 0.0984 -13.1 6.93e-39
## 3 num_sylls_est 0.362 0.0734 4.93 8.19e- 7
## 4 normalized_surprisal 3.42 0.182 18.7 4.84e-78
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 46716 11461734. 1 41818. 170. 6.93e-39
Directly contrast the relationships:
df_merged_german %>%
plot_contrast(bins = 20)
nrow(filter(df_merged_german, num_homophones == 1))
## [1] 1042
model_full = lm(data = filter(df_merged_german, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_merged_german, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -255.460 -1.288 2.997 6.323 16.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5286 3.7211 -0.411 0.6813
## entropy -5.9785 2.7656 -2.162 0.0309 *
## frequency -4.5311 0.6498 -6.973 5.50e-12 ***
## num_sylls_est -0.5602 0.7199 -0.778 0.4367
## normalized_surprisal 7.9448 1.1008 7.217 1.03e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.05 on 1037 degrees of freedom
## Multiple R-squared: 0.1082, Adjusted R-squared: 0.1048
## F-statistic: 31.47 on 4 and 1037 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_german, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1038 339256
## 2 1037 337734 1 1522 4.6732 0.03086 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
df_french = read_data("../data/processed/french/reals/french_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `1_ortho` = col_character(),
## `2_phon` = col_character(),
## `3_lemme` = col_character(),
## `4_cgram` = col_character(),
## `5_genre` = col_character(),
## `6_nombre` = col_character(),
## `11_infover` = col_character(),
## `17_cvcv` = col_character(),
## `18_p_cvcv` = col_character(),
## `23_syll` = col_character(),
## `25_cv-cv` = col_character(),
## `26_orthrenv` = col_character(),
## `27_phonrenv` = col_character(),
## `28_orthosyll` = col_character(),
## `29_cgramortho` = col_character(),
## `34_morphoder` = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_french)
## [1] 37278
## Now normalize probabilities and compute expected homophones per wordform
df_french = df_french %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_french %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 9 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 3893 3893
## 2 2 13794 13794
## 3 3 15258 15258
## 4 4 7589 7589
## 5 5 2466 2466
## 6 6 670 670
## 7 7 97 97
## 8 8 12 12
## 9 9 3 3
Read in French lemmas.
df_french_lemmas = read_csv("../data/processed/french/reals/french_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `1_ortho` = col_character(),
## `2_phon` = col_character(),
## `3_lemme` = col_character(),
## `4_cgram` = col_character(),
## `5_genre` = col_character(),
## `6_nombre` = col_character(),
## `11_infover` = col_character(),
## `17_cvcv` = col_character(),
## `18_p_cvcv` = col_character(),
## `23_syll` = col_character(),
## `25_cv-cv` = col_character(),
## `26_orthrenv` = col_character(),
## `27_phonrenv` = col_character(),
## `28_orthosyll` = col_character(),
## `29_cgramortho` = col_character(),
## `34_morphoder` = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_french_lemmas)
## [1] 43782
sum(df_french_lemmas$`8_freqlemlivres`)
## [1] 877113.8
df_french_lemmas = df_french_lemmas %>%
## Multiply by 14.8, b/c measures were divided by 14.8 in Lexique
mutate(freq_adjusted = `8_freqlemlivres` * 14.8) %>%
group_by(`2_phon`) %>%
summarise(total_frequency = sum(freq_adjusted))
nrow(df_french_lemmas)
## [1] 37278
sum(df_french_lemmas$total_frequency)
## [1] 12981285
df_french_merged = df_french %>%
inner_join(df_french_lemmas, by = "2_phon")
nrow(df_french)
## [1] 37278
nrow(df_french_merged)
## [1] 37278
df_french_merged$frequency = log10(df_french_merged$total_frequency + 1)
And finally, we can run the regression and visualize model coefficients:
output = df_french_merged %>%
run_regression()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.53 0.277 -20.0 2.41e- 88
## 2 frequency -0.446 0.0524 -8.51 1.82e- 17
## 3 num_sylls_est 0.563 0.0481 11.7 1.55e- 31
## 4 normalized_surprisal 3.18 0.109 29.1 1.47e-184
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37274 2687620. 1 5221. 72.4 1.82e-17
Directly contrast the relationships:
df_french_merged %>%
plot_contrast(bins = 20)
df_french_lemmas = read_csv("../data/processed/french/reals/french_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `1_ortho` = col_character(),
## `2_phon` = col_character(),
## `3_lemme` = col_character(),
## `4_cgram` = col_character(),
## `5_genre` = col_character(),
## `6_nombre` = col_character(),
## `11_infover` = col_character(),
## `17_cvcv` = col_character(),
## `18_p_cvcv` = col_character(),
## `23_syll` = col_character(),
## `25_cv-cv` = col_character(),
## `26_orthrenv` = col_character(),
## `27_phonrenv` = col_character(),
## `28_orthosyll` = col_character(),
## `29_cgramortho` = col_character(),
## `34_morphoder` = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
df_french_lemmas$freq_adjusted = df_french_lemmas$`8_freqlemlivres` * 14.8 + 1
nrow(df_french_lemmas)
## [1] 43782
df_french_lemmas = df_french_lemmas %>%
# mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
group_by(`2_phon`) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_french_lemmas %>%
group_by(`2_phon`) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
df_french_entropy = df_french_merged %>%
inner_join(df_entropy, by = "2_phon")
nrow(df_french_entropy)
## [1] 37278
nrow(df_french_merged)
## [1] 37278
nrow(filter(df_french_entropy, num_homophones == 1))
## [1] 4540
model_full = lm(data = filter(df_french_entropy, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_french_entropy, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -525.90 -0.21 1.14 2.08 5.54
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.74449 1.24840 -4.601 4.31e-06 ***
## entropy 0.09027 0.98403 0.092 0.926910
## frequency -0.81971 0.22840 -3.589 0.000336 ***
## num_sylls_est 0.83454 0.21022 3.970 7.30e-05 ***
## normalized_surprisal 3.69626 0.39147 9.442 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.13 on 4535 degrees of freedom
## Multiple R-squared: 0.02544, Adjusted R-squared: 0.02458
## F-statistic: 29.6 on 4 and 4535 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_french_entropy, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4536 562031
## 2 4535 562030 1 1.043 0.0084 0.9269
df_japanese = read_data("../data/processed/japanese/reals/japanese_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## orth_form_kanji = col_character(),
## orth_form_hiragana = col_character(),
## orth_form_romaji = col_character(),
## phonetic_form = col_character(),
## morph_form = col_character(),
## glosses = col_character(),
## multiple_pronunications = col_logical(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_japanese)
## [1] 40449
## Now normalize probabilities and compute expected homophones per wordform
df_japanese = df_japanese %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_japanese %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 17 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 1011 1011
## 2 2 7710 7710
## 3 3 14816 14816
## 4 4 14237 14237
## 5 5 6619 6619
## 6 6 3673 3673
## 7 7 1657 1657
## 8 8 733 733
## 9 9 345 345
## 10 10 185 185
## 11 11 72 72
## 12 12 52 52
## 13 13 19 19
## 14 14 8 8
## 15 15 7 7
## 16 16 2 2
## 17 22 1 1
Japanese already has frequency data.
df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## orth_form_kanji = col_character(),
## orth_form_hiragana = col_character(),
## orth_form_romaji = col_character(),
## phonetic_form = col_character(),
## morph_form = col_character(),
## frequency = col_double(),
## glosses = col_character(),
## multiple_pronunications = col_logical(),
## phonetic_remapped = col_character(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = mean(frequency))
nrow(df_japanese_lemmas)
## [1] 40449
df_japanese_merged = df_japanese %>%
inner_join(df_japanese_lemmas, by = "phonetic_remapped")
nrow(df_japanese)
## [1] 40449
nrow(df_japanese_merged)
## [1] 40449
df_japanese_merged$frequency = log10(df_japanese_merged$total_frequency + 1)
And finally, we can run the regression and visualize model coefficients:
output = df_japanese_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.83 0.144 -26.5 7.36e-154
## 2 frequency -1.73 0.106 -16.4 5.95e- 60
## 3 num_sylls_est 0.129 0.0183 7.05 1.85e- 12
## 4 normalized_surprisal 2.92 0.0851 34.3 1.91e-254
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40445 1206484. 1 7982. 268. 5.95e-60
Directly contrast the relationships:
df_japanese_merged %>%
plot_contrast(bins = 20)
df_mandarin = read_data("../data/processed/mandarin/reals/mandarin_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Key = col_character(),
## `PY+T` = col_character(),
## `Pho+T` = col_character(),
## `IPA+T` = col_character(),
## `PY-T` = col_character(),
## `Pho-T` = col_character(),
## `IPA-T` = col_character(),
## Tone = col_character(),
## SyStruct = col_character(),
## Words = col_character(),
## `Dominate-POS` = col_character(),
## `Other-POSes` = col_character(),
## Neighbors = col_character(),
## file = col_character(),
## homophonous_words = col_character(),
## word = col_character(),
## glides_remapped = col_character(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_mandarin)
## [1] 45871
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin = df_mandarin %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
nrow(df_mandarin)
## [1] 45871
## Double-check that this equals correct amount in lexicon
df_mandarin %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 12 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 4658 4658
## 2 2 32372 32372
## 3 3 9186 9186
## 4 4 5336 5336
## 5 5 312 312
## 6 6 97 97
## 7 7 56 56
## 8 8 17 17
## 9 9 3 3
## 10 10 6 6
## 11 11 3 3
## 12 14 1 1
Mandarin already has frequency data, but we need to make sure it's summed across lemmas.
df_mandarin_lemmas = read_csv("../data/processed/mandarin/reals/mandarin_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Key = col_character(),
## `PY+T` = col_character(),
## `Pho+T` = col_character(),
## `IPA+T` = col_character(),
## `PY-T` = col_character(),
## `Pho-T` = col_character(),
## `IPA-T` = col_character(),
## Tone = col_character(),
## SyStruct = col_character(),
## Words = col_character(),
## `Dominate-POS` = col_character(),
## `Other-POSes` = col_character(),
## Neighbors = col_character(),
## file = col_character(),
## homophonous_words = col_character(),
## word = col_character(),
## glides_remapped = col_character(),
## phonetic_remapped = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
df_mandarin_lemmas = df_mandarin_lemmas %>%
mutate(frequency = FreqPM * 1000000) %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = sum(frequency))
nrow(df_mandarin_lemmas)
## [1] 45871
df_mandarin_merged = df_mandarin %>%
inner_join(df_mandarin_lemmas, by = "phonetic_remapped")
nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 45871
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency + 1)
Now we can visualize the outcome in a variety of ways:
df_mandarin_merged %>%
plot_comparison()
df_mandarin_merged %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_mandarin_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.77 0.325 -11.6 4.16e-31
## 2 frequency -0.284 0.0548 -5.19 2.10e- 7
## 3 num_sylls_est 0.0996 0.0625 1.59 1.11e- 1
## 4 normalized_surprisal 4.37 0.256 17.1 2.67e-65
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 45867 4453802. 1 2617. 26.9 0.000000210
Directly contrast the relationships:
df_mandarin_merged %>%
plot_contrast(bins = 20)
Here, we calculate the Homophony Delta for Mandarin Chinese, using the Chinese Lexical Database.
df_mandarin_cld = read_data("../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 712019 parsing failures.
## row col expected actual file
## 30786 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## ..... ........... .................. ...... ...............................................................................
## See problems(...) for more details.
nrow(df_mandarin_cld)
## [1] 41009
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin_cld = df_mandarin_cld %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
nrow(df_mandarin_cld)
## [1] 41009
## Double-check that this equals correct amount in lexicon
df_mandarin_cld %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 4 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 3648 3648
## 2 2 31637 31637
## 3 3 6930 6930
## 4 4 3337 3337
Mandarin already has frequency data, but we need to make sure it's summed across lemmas.
df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
## row col expected actual file
## 35286 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(total_frequency = sum(FrequencyRaw))
nrow(df_mandarin_lemmas)
## [1] 41009
df_mandarin_merged = df_mandarin_cld %>%
inner_join(df_mandarin_lemmas, by = "phonetic_remapped")
nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 41009
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency)
Now we can visualize the outcome in a variety of ways:
df_mandarin_merged %>%
plot_comparison()
df_mandarin_merged %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_mandarin_merged %>%
run_regression()
output %>%
plot_outcome()
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.85 0.0987 -18.7 5.10e-78
## 2 frequency -0.275 0.0156 -17.6 8.29e-69
## 3 num_sylls_est 0.00764 0.0221 0.345 7.30e- 1
## 4 normalized_surprisal 2.71 0.0683 39.7 0
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 41005 303218. 1 2281. 309. 8.29e-69
Directly contrast the relationships:
df_mandarin_merged %>%
plot_contrast(bins = 20)
df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## C1 = col_character(),
## C2 = col_character(),
## C3 = col_logical(),
## C4 = col_logical(),
## C1Structure = col_character(),
## C2Structure = col_character(),
## C3Structure = col_logical(),
## C4Structure = col_logical(),
## C1Type = col_character(),
## C2Type = col_character(),
## C3Type = col_logical(),
## C4Type = col_logical(),
## Pinyin = col_character(),
## C1Pinyin = col_character(),
## C2Pinyin = col_character(),
## C3Pinyin = col_logical(),
## C4Pinyin = col_logical(),
## IPA = col_character(),
## C1IPA = col_character()
## # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
## row col expected actual file
## 35286 C3 1/0/T/F/TRUE/FALSE 么 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE SG '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type 1/0/T/F/TRUE/FALSE Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin 1/0/T/F/TRUE/FALSE me5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA 1/0/T/F/TRUE/FALSE mɤ5 '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
mutate(freq_adjusted = FrequencyRaw + 1) %>%
# mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
group_by(phonetic_remapped) %>%
mutate(total_frequency = sum(freq_adjusted),
relative_frequency = freq_adjusted / total_frequency)
df_entropy = df_mandarin_lemmas %>%
group_by(phonetic_remapped) %>%
summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
df_mandarin_entropy = df_mandarin_merged %>%
inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_mandarin_entropy)
## [1] 41009
nrow(filter(df_mandarin_entropy, num_homophones == 1))
## [1] 1871
model_full = lm(data = filter(df_mandarin_entropy, num_homophones == 1),
delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)
summary(model_full)
##
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal,
## data = filter(df_mandarin_entropy, num_homophones == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.710 -0.380 0.886 1.827 4.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7619 0.9007 -3.067 0.00220 **
## entropy -1.3703 0.4268 -3.211 0.00135 **
## frequency -0.7130 0.1257 -5.670 1.65e-08 ***
## num_sylls_est 0.5328 0.2405 2.216 0.02683 *
## normalized_surprisal 4.4816 0.4227 10.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.655 on 1866 degrees of freedom
## Multiple R-squared: 0.07813, Adjusted R-squared: 0.07615
## F-statistic: 39.54 on 4 and 1866 DF, p-value: < 2.2e-16
model_reduced = lm(data = filter(df_mandarin_entropy, num_homophones ==1),
delta ~ frequency + num_sylls_est + normalized_surprisal)
anova(model_reduced, model_full)
## Analysis of Variance Table
##
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1867 25067
## 2 1866 24929 1 137.71 10.308 0.001347 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below, we combine the lexica so that we can visualize the central relationships in the same plot.
df_merged_english_f = df_merged_english %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'English') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_merged_dutch_f = df_merged_dutch %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Dutch') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_merged_german_f = df_merged_german %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'German') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_french_f = df_french_merged %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'French') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_japanese_f = df_japanese_merged %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Japanese') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_mandarin_cld_f = df_mandarin_merged %>%
mutate(binned_frequency = ntile(frequency, n = 20)) %>%
mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
mutate(language = 'Mandarin') %>%
select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
num_homophones, k, frequency, binned_frequency, delta, language)
df_all_lexica = df_merged_english_f %>%
rbind(df_merged_dutch_f) %>%
rbind(df_merged_german_f) %>%
rbind(df_french_f) %>%
rbind(df_japanese_f) %>%
rbind(df_mandarin_cld_f)
PlotTheme = theme(
axis.title.x = element_text(size = 16),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
axis.title.y = element_text(size = 16),
strip.text.x = element_text(size = 16),
title = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16))
df_all_lexica %>%
group_by(binned_frequency, language) %>%
summarise(mean_delta = mean(delta),
se_delta = sd(delta) / sqrt(n())) %>%
ggplot(aes(x = binned_frequency,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - 2 * se_delta,
ymax = mean_delta + 2 *se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = "Binned Frequency",
y = "Delta (Real - Expected)") +
geom_smooth() +
theme_bw() +
facet_wrap(~language) +
PlotTheme
## `summarise()` has grouped output by 'binned_frequency'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/delta.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
mutate(Baseline = k,
Real = num_homophones) %>%
pivot_longer(c(Baseline, Real),
names_to = "Lexicon",
values_to = "homophones") %>%
group_by(binned_frequency, language, Lexicon) %>%
summarise(mean_homophones = mean(homophones),
se_homophones = sd(homophones) / sqrt(n())) %>%
ggplot(aes(x = binned_frequency,
y = mean_homophones,
color = Lexicon)) +
geom_point(stat = "identity", size = .5, alpha = .5) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones,
ymax = mean_homophones + 2 * se_homophones),
width=.2) +
labs(x = "Binned Frequency",
y = "Number of Homophones") +
geom_smooth(alpha = .6) +
theme_bw() +
facet_wrap(~language) +
PlotTheme +
theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` has grouped output by 'binned_frequency', 'language'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/fig1_frequency.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
mutate(Baseline = k,
Real = num_homophones) %>%
pivot_longer(c(Baseline, Real),
names_to = "Lexicon",
values_to = "homophones") %>%
group_by(binned_neighorhood_size, language, Lexicon) %>%
summarise(mean_homophones = mean(homophones),
se_homophones = sd(homophones) / sqrt(n())) %>%
ggplot(aes(x = binned_neighorhood_size,
y = mean_homophones,
color = Lexicon)) +
geom_point(stat = "identity", size = .5, alpha = .5) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones,
ymax = mean_homophones + 2 * se_homophones),
width=.2) +
labs(x = "Binned Neighborhood Size",
y = "Number of Homophones") +
geom_smooth(alpha = .6) +
theme_bw() +
facet_wrap(~language) +
PlotTheme +
theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` has grouped output by 'binned_neighorhood_size', 'language'. You can override using the `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggsave("../Figures/fig3_neighbors.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
return_regression_table = function(df) {
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal)
# Tidy model output
df_model = broom::tidy(model_full)
df_model
}
### Get coefficients for each language
english_coefficients = df_merged_english_f %>%
return_regression_table() %>%
mutate(language = 'English')
dutch_coefficients = df_merged_dutch_f %>%
return_regression_table() %>%
mutate(language = 'Dutch')
german_coefficients = df_merged_german_f %>%
return_regression_table() %>%
mutate(language = 'German')
french_coefficients = df_french_f %>%
return_regression_table() %>%
mutate(language = 'French')
japanese_coefficients = df_japanese_f %>%
return_regression_table() %>%
mutate(language = 'Japanese')
mandarin_coefficients = df_mandarin_cld_f %>%
return_regression_table() %>%
mutate(language = 'Mandarin')
# Combine into single dataframe
df_all_coefficients = english_coefficients %>%
rbind(dutch_coefficients) %>%
rbind(german_coefficients) %>%
rbind(french_coefficients) %>%
rbind(japanese_coefficients) %>%
rbind(mandarin_coefficients)
df_all_coefficients$predictor = fct_recode(
df_all_coefficients$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency"
)
### Plots outcome of regression
df_all_coefficients %>%
filter(predictor != "(Intercept)") %>%
ggplot(aes(x = language,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_bw() +
facet_wrap(~predictor) +
PlotTheme
### Save to figure
ggsave("../Figures/fig2_coefficients.png", dpi = 300)
## Saving 7 x 5 in image
Across all languages tested, it appears that Frequency exhibits a negative relationship with Homophony Delta: more frequent wordforms are less homophonous than one would expect purely on the basis of their phonotactics.
But there are other predictions, again borne out by previous research (Trott & Bergen, 2020), about which areas of the lexicon may see the greatest increase or reduction in homophony. One example is Neighborhood Size. Wordforms with more neighbors might see a corresponding decrease in homophony for one of two reasons. These explanations are not mutually incompatible, though they do suppose distinct causal pathways for the reduction in homophony:
If larger neighborhoods already increase the possibility of perceptual confusion (e.g., "bat" vs. "pat"), one might expect that, wordform \(w_i\) might have fewer homophones than one would expect on account of its phonotactic probability \(p_i\) if \(w_i\) already has a number of neighbors.
If languages select against over-saturation (Trott & Bergen, 2020), and if this selection process ultimately results in larger neighborhood sizes, one would expect that the areas of the lexicon that have undergone the most negative selection relative to their phonotactics should have correspondingly larger neighborhoods.
run_regression_with_neighborhood_size = function(df) {
# initialize output
out = list()
# Log neighborhood
df$log_neighborhood = log10(df$neighborhood_size + 1)
# Build full model
model_full = lm(data = df,
delta ~ frequency + num_sylls_est + normalized_surprisal + log_neighborhood)
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal + frequency)
# Run model comparison
comparison = anova(model_reduced, model_full)
df_comp = broom::tidy(comparison) %>%
na.omit()
# Tidy model output
df_model = broom::tidy(model_full)
# Add to list parameters
out$model = df_model
out$comparison = df_comp
out
}
plot_neighborhood_bins_residualized = function(df, n) {
### Plot residuals of delta ~ #sylls + surprisal + frequency against neighborhood size
# Build reduced model
model_reduced = lm(data = df,
delta ~ num_sylls_est + normalized_surprisal + frequency)
df$resid = residuals(model_reduced)
# Plots delta ~ frequency bins.
df$neighborhood_binned = as.numeric(ntile(df$neighborhood_size, n = n))
g = df %>%
group_by(neighborhood_binned) %>%
summarise(mean_delta = mean(resid),
se_delta = sd(resid) / sqrt(n())) %>%
ggplot(aes(x = neighborhood_binned,
y = mean_delta)) +
geom_point(stat = "identity", size = .2) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = mean_delta - se_delta,
ymax = mean_delta + se_delta),
width=.2,
position=position_dodge(.9)) +
labs(x = "Binned Neighborhood Size",
y = "Residuals (Reduced Model)") +
geom_smooth() +
theme_minimal()
g
}
plot_outcome = function(df_output) {
df_output$model$predictor = fct_recode(
df_output$model$term,
"Number of Syllables" = "num_sylls_est",
"Normalized Surprisal" = "normalized_surprisal",
"Log(Frequency)" = "frequency",
"Log(Neighborhood Size)" = "log_neighborhood"
)
### Plots outcome of regression
g = df_output$model %>%
ggplot(aes(x = predictor,
y = estimate)) +
geom_point() +
coord_flip() +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_errorbar(aes(ymin = estimate - 2*std.error,
ymax = estimate + 2*std.error),
width=.2,
position=position_dodge(.9)) +
labs(x = "Predictor",
y = "Estimate") +
theme_minimal()
g
}
df_merged_english %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_english %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_english %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5.16 0.394 -13.1 5.27e- 39
## 2 frequency -0.363 0.0666 -5.45 5.08e- 8
## 3 num_sylls_est 0.326 0.0861 3.79 1.52e- 4
## 4 normalized_surprisal 3.55 0.131 27.1 2.47e-159
## 5 log_neighborhood -1.59 0.182 -8.76 2.08e- 18
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28909 2735378. 1 7258. 76.7 2.08e-18
df_merged_dutch %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_dutch %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_dutch %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.52 0.399 -11.3 8.77e- 30
## 2 frequency -1.74 0.0727 -23.9 2.41e-125
## 3 num_sylls_est 0.264 0.0769 3.43 6.07e- 4
## 4 normalized_surprisal 4.07 0.165 24.6 6.31e-133
## 5 log_neighborhood -1.10 0.229 -4.79 1.68e- 6
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63359 16260096. 1 5887. 22.9 0.00000168
df_merged_german %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_merged_german %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_merged_german %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.57 0.456 -7.83 4.92e-15
## 2 frequency -1.17 0.100 -11.7 2.18e-31
## 3 num_sylls_est 0.145 0.0809 1.79 7.31e- 2
## 4 normalized_surprisal 3.35 0.183 18.4 5.40e-75
## 5 log_neighborhood -1.85 0.292 -6.34 2.32e-10
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 46715 11451881. 1 9852. 40.2 2.32e-10
df_french_merged %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_french_merged %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_french_merged %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.49 0.301 -8.25 1.59e- 16
## 2 frequency -0.153 0.0533 -2.87 4.09e- 3
## 3 num_sylls_est -0.132 0.0556 -2.37 1.76e- 2
## 4 normalized_surprisal 3.07 0.108 28.3 1.07e-174
## 5 log_neighborhood -3.49 0.143 -24.4 1.27e-130
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37273 2645315. 1 42305. 596. 1.27e-130
df_japanese_merged %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_japanese_merged %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_japanese_merged %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.236 0.177 -1.33 1.84e- 1
## 2 frequency -1.34 0.105 -12.8 3.49e- 37
## 3 num_sylls_est -0.398 0.0238 -16.7 1.18e- 62
## 4 normalized_surprisal 2.68 0.0842 31.8 1.92e-219
## 5 log_neighborhood -2.50 0.0737 -34.0 3.37e-249
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40444 1173044. 1 33439. 1153. 3.37e-249
df_mandarin_merged %>%
plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_mandarin_merged %>%
plot_neighborhood_bins_residualized(n = 20)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
output = df_mandarin_merged %>%
run_regression_with_neighborhood_size()
output %>%
plot_outcome()
output$model
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.284 0.109 2.61 9.18e- 3
## 2 frequency -0.178 0.0155 -11.5 1.88e- 30
## 3 num_sylls_est -0.595 0.0259 -23.0 3.83e-116
## 4 normalized_surprisal 2.42 0.0673 35.9 1.17e-278
## 5 log_neighborhood -2.10 0.0496 -42.4 0
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 41004 290500. 1 12718. 1795. 0
The three predictors (frequency, #syllables, and surprisal) are all correlated. The correlation between the latter two is addressed by using Normalized Surprisal instead of Surprisal. But this leaves the correlation between frequency and #syllables and betwee frequency and surprisal.
Collinearity between predictors can sometimes lead to strange coefficient estimates (even while not affecting overall model fit). There are several ways that researchers try to account for potential collinearity. As a first step, researchers can calculate the variance inflation factor. If the VIF is sufficiently high, that suggests that collinearity might be present.
Below, we calculate the VIF for the main model for all languages. We see that the VIF scores from the complete model for all languages are all below a maximum of 1.43 (in French). This is reassuring.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
m = lm(data = df_merged_english,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.127951 1.482446 1.337387
max(vif(m))
## [1] 1.482446
m = lm(data = df_merged_german,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.097079 1.309988 1.240748
max(vif(m))
## [1] 1.309988
m = lm(data = df_merged_dutch,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.057189 1.316050 1.253957
max(vif(m))
## [1] 1.31605
m = lm(data = df_japanese_merged,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.039111 1.125562 1.098075
max(vif(m))
## [1] 1.125562
m = lm(data = df_french_merged,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.112989 1.478553 1.348290
max(vif(m))
## [1] 1.478553
m = lm(data = df_mandarin_merged,
delta ~ frequency + num_sylls_est + normalized_surprisal)
vif(m)
## frequency num_sylls_est normalized_surprisal
## 1.131245 1.162525 1.029433
max(vif(m))
## [1] 1.162525
Here, we check whether the original finding (i.e., frequent words are more ambiguous) is indeed significant across languages.
m = glm(data = df_merged_english,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_merged_english)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4621 -0.5686 -0.3986 -0.2530 4.3877
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.29534 0.07295 -17.757 <2e-16 ***
## frequency 0.67082 0.01217 55.126 <2e-16 ***
## num_sylls_est -0.70260 0.01972 -35.632 <2e-16 ***
## normalized_surprisal -0.02284 0.02419 -0.944 0.345
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 23170 on 28913 degrees of freedom
## Residual deviance: 15780 on 28910 degrees of freedom
## AIC: 26494
##
## Number of Fisher Scoring iterations: 6
m = glm(data = df_merged_german,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_merged_german)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1118 -0.2156 -0.1563 -0.1212 5.2721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.35319 0.16488 -20.337 <2e-16 ***
## frequency 0.88212 0.02730 32.308 <2e-16 ***
## num_sylls_est -0.53413 0.03763 -14.196 <2e-16 ***
## normalized_surprisal 0.06701 0.05735 1.168 0.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 9008.1 on 46719 degrees of freedom
## Residual deviance: 7126.7 on 46716 degrees of freedom
## AIC: 9402.1
##
## Number of Fisher Scoring iterations: 7
m = glm(data = df_merged_dutch,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_merged_dutch)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7835 -0.2248 -0.1559 -0.1168 4.8731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.10770 0.11126 -18.945 < 2e-16 ***
## frequency 0.63707 0.01660 38.384 < 2e-16 ***
## num_sylls_est -0.99196 0.03195 -31.047 < 2e-16 ***
## normalized_surprisal 0.12656 0.03544 3.571 0.000355 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 15734 on 63363 degrees of freedom
## Residual deviance: 11000 on 63360 degrees of freedom
## AIC: 14952
##
## Number of Fisher Scoring iterations: 7
m = glm(data = df_japanese_merged,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_japanese_merged)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0122 -0.7595 -0.4818 -0.1831 8.8324
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.62451 0.05695 46.08 <2e-16 ***
## frequency 0.56094 0.01549 36.22 <2e-16 ***
## num_sylls_est -1.02396 0.01248 -82.04 <2e-16 ***
## normalized_surprisal -0.51095 0.02442 -20.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 46530 on 40448 degrees of freedom
## Residual deviance: 32937 on 40445 degrees of freedom
## AIC: 47052
##
## Number of Fisher Scoring iterations: 6
m = glm(data = df_french_merged,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_french_merged)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4353 -0.5689 -0.4168 -0.3199 5.5647
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.44129 0.07151 -34.141 <2e-16 ***
## frequency 0.75907 0.01220 62.200 <2e-16 ***
## num_sylls_est -0.19130 0.01511 -12.661 <2e-16 ***
## normalized_surprisal 0.02898 0.02442 1.186 0.235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 26447 on 37277 degrees of freedom
## Residual deviance: 21159 on 37274 degrees of freedom
## AIC: 32466
##
## Number of Fisher Scoring iterations: 6
m = glm(data = df_mandarin_merged,
num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
family = poisson())
summary(m)
##
## Call:
## glm(formula = num_homophones ~ frequency + num_sylls_est + normalized_surprisal,
## family = poisson(), data = df_mandarin_merged)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1664 -0.3767 -0.2930 -0.0850 7.2235
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.40941 0.12936 18.62 <2e-16 ***
## frequency 0.54218 0.01597 33.96 <2e-16 ***
## num_sylls_est -2.71580 0.04033 -67.34 <2e-16 ***
## normalized_surprisal -1.08918 0.06335 -17.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 27993 on 41008 degrees of freedom
## Residual deviance: 13684 on 41005 degrees of freedom
## AIC: 19636
##
## Number of Fisher Scoring iterations: 6
First, we load and process the data.
## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CobLog = col_double(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## Class = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 9 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 7706 7706
## 2 2 15247 15247
## 3 3 11379 11379
## 4 4 5316 5316
## 5 5 1694 1694
## 6 6 439 439
## 7 7 95 95
## 8 8 10 10
## 9 10 1 1
We then merge with frequency data.
df_subtlex = read_csv("../data/frequency/english/SUBTLEXusfrequencyabove1.csv")
## Parsed with column specification:
## cols(
## Word = col_character(),
## FREQcount = col_double(),
## CDcount = col_double(),
## FREQlow = col_double(),
## Cdlow = col_double(),
## SUBTLWF = col_double(),
## Lg10WF = col_double(),
## SUBTLCD = col_double(),
## Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 60384
df_subtlex$Word = tolower(df_subtlex$Word)
df_merged_english = df_english %>%
inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_english)
## [1] 24307
df_merged_english$frequency = df_merged_english$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_english %>%
plot_comparison()
df_merged_english %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_english %>%
run_regression()
output %>%
plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.59 0.425 -15.5 4.55e- 54
## 2 frequency -0.655 0.0837 -7.82 5.29e- 15
## 3 num_sylls_est 0.674 0.0791 8.52 1.71e- 17
## 4 normalized_surprisal 3.92 0.154 25.4 1.95e-140
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24303 2731590. 1 6882. 61.2 5.29e-15
Directly contrast the relationships:
df_merged_english %>%
plot_contrast(bins = 20)
First, we load and process the data.
df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## Word = col_character(),
## PhonStrsDISC = col_character(),
## PhonCVBr = col_character(),
## PhonSylBCLX = col_character(),
## PhonStrsStDISC = col_character(),
## PhonStCVBr = col_character(),
## PhonSylStBCLX = col_character(),
## PhonolCLX = col_character(),
## PhonolCPA = col_character(),
## PhonDISC = col_character(),
## remove = col_logical()
## )
## See spec(...) for full column specifications.
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 9 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 4672 4672
## 2 2 20588 20588
## 3 3 26420 26420
## 4 4 11338 11338
## 5 5 3394 3394
## 6 6 846 846
## 7 7 189 189
## 8 8 28 28
## 9 9 2 2
We then merge with frequency data.
df_subtlex = read_delim("../data/frequency/dutch/SUBTLEX-NL.txt", delim="\t")
## Parsed with column specification:
## cols(
## Word = col_character(),
## FREQcount = col_double(),
## CDcount = col_double(),
## FREQlow = col_double(),
## CDlow = col_double(),
## FREQlemma = col_double(),
## SUBTLEXWF = col_double(),
## Lg10WF = col_double(),
## SUBTLEXCD = col_double(),
## Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 437503
df_subtlex$Word = tolower(df_subtlex$Word)
df_merged_dutch = df_dutch %>%
inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_dutch)
## [1] 28330
df_merged_dutch$frequency = df_merged_dutch$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_dutch %>%
plot_comparison()
df_merged_dutch %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_dutch %>%
run_regression()
output %>%
plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.64 0.834 -7.96 1.73e-15
## 2 frequency -3.38 0.170 -19.8 4.90e-87
## 3 num_sylls_est 0.289 0.164 1.77 7.75e- 2
## 4 normalized_surprisal 6.58 0.328 20.1 7.51e-89
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28326 15675501. 1 217909. 394. 4.90e-87
Directly contrast the relationships:
df_merged_dutch %>%
plot_contrast(bins = 20)
df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## `Unnamed: 0` = col_double(),
## index = col_double(),
## Word = col_character(),
## CompCnt = col_double(),
## PhonDISC = col_character(),
## SylCnt = col_double(),
## num_phones = col_double(),
## num_sylls_est = col_double(),
## remove = col_logical(),
## num_homophones = col_double(),
## log_prob = col_double(),
## surprisal = col_double(),
## rank_num_homophones = col_double(),
## neighborhood_size = col_double(),
## heldout_log_prob = col_double(),
## heldout_surprisal = col_double()
## )
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>%
normalize_probabilities() %>%
compute_expected_homophones()
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
group_by(num_sylls_est) %>%
summarise(total = sum(k + 1),
correct_total = mean(M))
## # A tibble: 10 × 3
## num_sylls_est total correct_total
## <dbl> <dbl> <dbl>
## 1 1 2454 2454
## 2 2 13181 13181
## 3 3 19429 19429
## 4 4 10983 10983
## 5 5 3971 3971
## 6 6 1240 1240
## 7 7 328 328
## 8 8 72 72
## 9 9 16 16
## 10 10 2 2
We then merge with frequency data.
df_freq = read_csv("../data/frequency/german/SUBTLEX-DE.csv")
## Parsed with column specification:
## cols(
## Word = col_character(),
## `Spell check` = col_double(),
## FREQcount = col_double()
## )
nrow(df_freq)
## [1] 377524
# df_freq$Word = tolower(df_freq$Word)
# log10 frequency (+ 1). SHould be analogous to measures from English.
df_freq$Lg10WF = log10(df_freq$FREQcount + 1)
df_merged_german = df_german %>%
inner_join(df_freq, on = "Word")
## Joining, by = "Word"
nrow(df_merged_german)
## [1] 24350
df_merged_german$frequency = df_merged_german$Lg10WF
Now we can visualize the outcome in a variety of ways:
df_merged_german %>%
plot_comparison()
df_merged_german %>%
plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
And finally, we can run the regression and visualize model coefficients:
output = df_merged_german %>%
run_regression()
output %>%
plot_outcome()
## Warning: Unknown levels in `f`: log_neighborhood
output$model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.76 0.399 -12.0 7.99e- 33
## 2 frequency -1.67 0.0888 -18.8 3.12e- 78
## 3 num_sylls_est 0.316 0.0731 4.32 1.55e- 5
## 4 normalized_surprisal 4.19 0.157 26.7 2.36e-154
output$comparison
## # A tibble: 1 × 6
## res.df rss df sumsq statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24346 2750831. 1 39903. 353. 3.12e-78
Directly contrast the relationships:
df_merged_german %>%
plot_contrast(bins = 20)